Ensembling and Clustering Approach to Gene Selection
نویسندگان
چکیده
In pattern recognition the problem of input variable selection has been traditionally focused on technological issues, e.g., performance enhancement, lowering computational requirements, and reduction of data acquisition costs. However, in the last few years, it has found many applications in basic science as a model selection and discovery technique, as shown by a rich literature on this subject, witnessing the interest of the topic especially in the field of bioinformatics. A clear example arises from DNA microarray technology that provides high volumes of data for each single experiment, yielding measurements for hundreds of genes simultaneously. In this paper, we propose a flexible method for analyzing the relevance of input variables in high dimensional problems with respect to a given dichotomic classication problem. Both linear and non-linear cases are considered. In the linear case, the application of derivative-based saliency yields a commonly adopted ranking criterion. In the non-linear case, the approach is extended by introducing a resampling technique and by clustering the obtained results for stability of the estimate. The method we propose (seeTab. 1) is termed Random Voronoi Ensemble since it is based on random Voronoi partitions [1], and these partitions are replicated by resampling, so the method actually uses an ensemble of random Voronoi partitions. Within each Voronoi region, a linear classification is performed using Support Vector Machines (SVM) with a linear kernel [4], while, to integrate the outcomes of the ensemble, we use the Graded Possibilistic Clustering technique to ensure an appropriate level of outlier insensitivity [3].
منابع مشابه
Improving Accuracy in Intrusion Detection Systems Using Classifier Ensemble and Clustering
Recently by developing the technology, the number of network-based servicesis increasing, and sensitive information of users is shared through the Internet.Accordingly, large-scale malicious attacks on computer networks could causesevere disruption to network services so cybersecurity turns to a major concern fornetworks. An intrusion detection system (IDS) could be cons...
متن کاملRanking Pharmaceutics Industry Using SD-Heuristics Approach
In recent years stock exchange has become one of the most attractive and growing businesses in respect of investment and profitability. But applying a scientific approach in this field is really troublesome because of variety and complexity of decision making factors in the field. This paper tries to deliver a new solution for portfolio selection based on multi criteria decision making literatu...
متن کاملA Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset
Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...
متن کاملTesting Several Rival Models Using the Extension of Vuong\'s Test and Quasi Clustering
The two main goals in model selection are firstly introducing an approach to test homogeneity of several rival models and secondly selecting a set of reasonable models or estimating the best rival model to the true one. In this paper we extend Vuong's method for several models to cluster them. Based on the working paper of Katayama $(2008)$, we propose an approach to test whether rival models h...
متن کاملSFLA Based Gene Selection Approach for Improving Cancer Classification Accuracy
In this paper, we propose a new gene selection algorithm based on Shuffled Frog Leaping Algorithm that is called SFLA-FS. The proposed algorithm is used for improving cancer classification accuracy. Most of the biological datasets such as cancer datasets have a large number of genes and few samples. However, most of these genes are not usable in some tasks for example in cancer classification....
متن کامل